Change “your name” in the YAML header above to your name.
As usual, enter the examples in code chunks and run them, unless told otherwise.
Read R4ds Chapter 10: Tibbles, sections 1-3.
Load the tidyverse package.
library(tidyverse)
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[30m[32mv[30m [34mggplot2[30m 3.1.0 [32mv[30m [34mpurrr [30m 0.2.5
[32mv[30m [34mtibble [30m 1.4.2 [32mv[30m [34mdplyr [30m 0.7.8
[32mv[30m [34mtidyr [30m 0.8.2 [32mv[30m [34mstringr[30m 1.3.1
[32mv[30m [34mreadr [30m 1.3.0 [32mv[30m [34mforcats[30m 0.3.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Enter your code chunks for Section 10.2 here.
Describe what each chunk code does.
Example 1:
as_tibble(iris)
Example 1 plots the iris data.
Example 2:
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
Example 2 takes the inputs for x, y and z and prints out the answer.
Example 3:
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
Example 3 shows how non valid R variable names can be used by tibble.
Example 4:
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
Example 4 shows the use of tribble to create code based graphs.
Enter your code chunks for Section 10.3 here.
Describe what each chunk code does.
Example 1:
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
Example 1 shows the basic way tibble displays large data in a helpful manner.
Example 2:
df <- tibble(
x = runif(5),
y = rnorm(5)
)
df$x
[1] 0.55262037 0.60017835 0.47087449 0.05166474 0.79727797
df[["x"]]
[1] 0.55262037 0.60017835 0.47087449 0.05166474 0.79727797
df[[1]]
[1] 0.55262037 0.60017835 0.47087449 0.05166474 0.79727797
Example 2 shows how $ and [[]] can be used to pull out specific data from a data.
Answer the questions completely. Use code chunks, text, or both, as necessary.
1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame). Identify at least two ways to tell if an object is a tibble. Hint: What does as_tibble() do? What does class() do? What does str() do? Answer for question 1: Tibbles print out the first 10 observations. Columns in tibbles print out what type of data they are, a feature borrowed from str().
2: Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration? Answer for question 2: df$x returned the entire “xyz” section which could return unwanted data in certain circumstances.
df <- data.frame(abc = 1, xyz = "a")
df$x
[1] a
Levels: a
df[, "xyz"]
[1] a
Levels: a
df[, c("abc", "xyz")]
str(df)
'data.frame': 1 obs. of 2 variables:
$ abc: num 1
$ xyz: Factor w/ 1 level "a": 1
tibble <- as_tibble(df)
tibble$x
Unknown or uninitialised column: 'x'.
NULL
tibble[, "xyz"]
tibble[, c("abc", "xyz")]
Read R4ds Chapter 11: Data Import, sections 1, 2, and 5.
Nothing to do here unless you took a break and need to reload tidyverse.
Do not run the first code chunk of this section, which begins with heights <- read_csv("data/heights.csv"). You do not have that data file so the code will not run.
Enter and run the remaining chunks in this section.
Example 1:
read_csv("a,b,c
1,2,3
4,5,6")
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
Example 1 plots a chart with column names taken from the first line.
Example 2:
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
Example 2 shows the use of the skip command or “#” to skip a set number of lines so you can exclude potentially unwanted metadata.
Example 3:
read_csv("1,2,3\n4,5,6", col_names = FALSE)
Example 3 lets read_csv that the data doesn’t have column names.
Example 4:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
Example 4 assigns names to the columns.
Example 5:
read_csv("a,b,c\n1,2,.", na = ".")
Example 5 shows how the na commmand can replace missing data points.
1: What function would you use to read a file where fields were separated with “|”? Answer for question 1: read_delim would be the function used.
2: (This question is modified from the text.) Finish the two lines of read_delim code so that the first one would read a comma-separated file and the second would read a tab-separated file. You only need to worry about the delimiter. Do not worry about other arguments. Replace the dots in each line with the rest of your code.
file <- read_delim("file.csv", delim = ",")
file <- read_delim("file.csv", delim = "\t")
3: What are the two most important arguments to read_fwf()? Why? Answer to question 3: col_positions which specifies the column’s beginning and end, and col_types which parse out the columns.
4: Skip this question
5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
read_csv("a,b,c\n1,2\n1,2,3,4")
2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
read_csv("a,b\n\"1")
2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
Answer for question 5: 1. The data has 3 columns worth of data, but only specifies 2 columns in the header. As such, the last column of data is dropped. 2. There are 3 columns which do not match the amount of data, the first row is listed as missing an entry and the second row has a entry dropped. 3. There are 2 columns but the data only consists of 1 entry, the second column is then listed as empty. 4. Operates fine but has data that is entered the same as the column names which is confusing. 5. Data is seperated by “;” so instead of seperating into 2 columns it forms 1 column.
Just read this section. You may find it helpful in the future to save a data file to your hard drive. It is basically the same format as reading a file, except that you must specify the data object to save, in addition to the path and file name.
Read R4ds Chapter 18: Pipes, sections 1-3.
Nothing to do otherwise for this chapter. Is this easy or what?
Note: Trying using pipes for all of the remaining examples. That will help you understand them.
Read R4ds Chapter 12: Tidy Data, sections 1-3, 7.
Nothing to do here unless you took a break and need to reload the tidyverse.
Study Figure 12.1 and relate the diagram to the three rules listed just above them. Relate that back to the example I gave you in the notes. Bear this in mind as you make data tidy in the second part of this assignment.
You do not have to run any of the examples in this section.
Read and run the examples through section 12.3.1 (gathering), including the example with left_join(). We’ll cover joins later.
Example 1:
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
Example 1 gathers 1999 and 2000 into a proper column and tidys the data.
Example 2:
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
Example 2 does the same thing as 1 to a different data set.
Example 3:
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
dplyr::left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
2: Why does this code fail? Fix it so it works. Answer to question 2: Lack of back ticks around the year entries.
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
#> Error in inds_combine(.vars, ind_list): Position must be between 0 and n
That is all for Chapter 12. On to the last chapter.
Read R4ds Chapter 5: Data Transformation, sections 1-4.
Time to get small.
Load the necessary libraries. As usual, type the examples into and run the code chunks.
install.packages("nycflights13")
Error in install.packages : Updating loaded packages
Library
library(nycflights13)
flights
filter()Study Figure 5.1 carefully. Once you learn the &, |, and ! logic, you will find them to be very powerful tools.
Example 1:
dplyr::filter(flights, month==1, day==2)
Example 2:
jan1 <- dplyr::filter(flights, month==1, day==1)
Example 3:
(dec25 <- dplyr::filter(flights, month == 12, day == 25))
Example 4:
near(sqrt(2) ^ 2, 2)
[1] TRUE
near(1 / 49 * 49, 1)
[1] TRUE
Example 5:
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
1.1: Find all flights with a delay of 2 hours or more. Answer for question 1.1:
filter(flights, dep_delay >= 120 & arr_delay >= 120)
1.2: Flew to Houston (IAH or HOU) Answer to question 1.2:
filter(flights, dest == "HOU"| dest == "IAH")
1.3: Were operated by United (UA), American (AA), or Delta (DL). Answer for question 1.3:
filter(flights, carrier == "UA" | carrier == "AA" | carrier == "DL")
1.4: Departed in summer (July, August, and September). Answer for question 1.4:
filter(flights, month == 7| month == 8| month == 9)
1.5: Arrived more than two hours late, but didn’t leave late. Answer for question 1.5:
filter(flights, arr_delay > 120, dep_delay == 0)
1.6: Were delayed by at least an hour, but made up over 30 minutes in flight. This is a tricky one. Do your best. Answer for question 1.6:
filter(flights, dep_delay >= 60, arr_delay <= 30)
1.7: Departed between midnight and 6am (inclusive) Answer for question 1.7:
filter(flights, dep_time >= 0000, dep_time <= 0600)
2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges? Answer for question 2: between tells if values in a numeric vector fall in a specified range. It could’ve been used to simplify 1.7.
?between
3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent? Answer for question 3: 8,255 flights have a missing dep_time. dep_delay, arr_time, arr_delay and air_time are also missing. These represent that the time data of the flight weren’t recorded for those entries.
filter(flights, is.na(dep_time))
4: Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!) Answer to question 4: NA is still considered as a variable, as such it can operate in the afforementioned equations.
Note: For some context, see this thread
arrange()Example 1:
arrange(flights, year, month, day)
Example 1 arranges the columns in the specific order.
Example 2:
arrange(flights, desc(dep_delay))
Example 2 arranges dep_delay in descending order.
Example 3:
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). Note: This one should still have the earliest departure dates after the NAs. Hint: What does desc() do? Answer for question 1:
rowSums(df)
[1] 5 2 NA
arrange(df, desc(is.na(x)))
arrange(df, -(is.na(x)))
2: Sort flights to find the most delayed flights. Find the flights that left earliest.
This question is asking for the flights that were most delayed (left latest after scheduled departure time) and least delayed (left ahead of scheduled time).
arrange(flights, desc(dep_delay))
arrange(flights, desc(-dep_delay))
3: Sort flights to find the fastest flights. Interpret fastest to mean shortest time in the air. Answer to question 3:
arrange(flights, desc(-air_time))
Optional challenge: fastest flight could refer to fastest air speed. Speed is measured in miles per hour but time is minutes. Arrange the data by fastest air speed.
4: Which flights travelled the longest? Which travelled the shortest? Answer to question 4:
arrange(flights, desc(distance))
arrange(flights, desc(-distance))
select()Example 1:
select(flights, year, month, day)
Example 2:
select(flights, -(year:day))
Example 3:
rename(flights, tail_num = tailnum)
Example 4:
select(flights, time_hour, air_time, everything())
1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. Find at least three ways. Answer to question 1:
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, 4, 6, 7, 9)
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
2: What happens if you include the name of a variable multiple times in a select() call? Answer for question 2:
select(flights, dep_time, dep_time)
The repitition is ignored and it selects the variable as normal.
3: What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay") Answer to question 3: The one_of selects a variable using a character vector, not unquoted variables. Results in an easier time typing out a selction from the example.
4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
`select(flights, contains("TIME"))`
Error: object 'select(flights, contains("TIME"))' not found
Answer for question 4: The results don’t surprise me, case is ignorred by contains(). It can be changed with the argument ignore.case = FALSE.